HOME
AWS Glue Tutorial for Beginners - Part I
SOURCE: TheBatu81
3 years ago
TLDR; if you want to move data from one AWS data source to another AWS data source and do some calculations on that data before you transfer it and (maybe?) also schedule it so it runs periodically, then you can use AWS Glue to easily put together workflows that can run sequence of simple to advanced custom jobs. I believe there are also ways to connect to other sources too but for this tutorial we will stick with AWS sources.

In this first part of the tutorial, I will explain each of the components you need to interact with in order to put together a scheduled workflow to move data from AWS DynamoDB to AWS RDS Mysql. In the next part, we will put together a sample project that will move the new data daily. If there is interest, I could have a part 3 to go over how to handle duplicate data issues.


Here is how Amazon AWS defines their Glue service;

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue provides all the capabilities needed for data integration so that you can start analyzing your data and putting it to use in minutes instead of months.

Data integration is the process of preparing and combining data for analytics, machine learning, and application development. It involves multiple tasks, such as discovering and extracting data from various sources; enriching, cleaning, normalizing, and combining data; and loading and organizing data in databases, data warehouses, and data lakes. These tasks are often handled by different types of users that each use different products.

AWS Glue provides both visual and code-based interfaces to make data integration easier. Users can easily find and access data using the AWS Glue Data Catalog. Data engineers and ETL (extract, transform, and load) developers can visually create, run, and monitor ETL workflows with a few clicks in AWS Glue Studio.
As with most things, AWS likes to complicate things and put a lot of options in front of you assuming you will figure it out eventually. While that is true, it does waste time. With these most basic components, you should be able to get a production ETL process going... I have!

I am currently running this process for my team and everyone seems to like it from Engineers to the Data Scientists. It is very flexibility and while you can start off with few components to get a production ETL process, it provides enough resources to have very advanced, complex workflows.

To move data from DynamoDB to MySQL, you need;

- Create a Glue Database
- 1 Custom JDBC Connection for MySQL
- 2 Crawlers, 1 for DynamoDB and 1 for MySQL
- At least 1 basic job
- Add a trigger for that simple job
- OR... if you have multiple jobs and they need to execute in order, you can add a workflow to group your jobs.

That's it!

They have Schemas, Schema Registries, Settings, Blueprints, Dev Endpoints, Notebooks, Security Configurations, etc....which you DON'T need to have for a complex advanced ETL process. They could help, and they are expensive but I haven't found a need for it so far.

First Step... Create Database

This sounds more complicated than it actually is. All you do is define a name for the database project and click on create! That's it! You now have a Glue database.
Second Step... Connection

While DynamoDB doesn't require any custom specifications, AWS RDS Requires custom connections to be defined before you can discover the table schemas in Glue. When you are adding the connection, you also define the username, password you want the connection to use as well as any organizational VPC.
Third Step... Crawlers

There are 2 ways you can define tables in AWS Glue, one way is custom creation of table and all its schema by hand, or the other way is to create a crawler that will automatically discover your table schemas and update them in the glue database for you. This way if you have a database that changes a lot, or you are planning on modifying the schema a lot, you can easily just run a crawler and it will update table definitions within Glue for you.

Important thing with crawlers is that if you are crawling an S3 source, they provide an out of the box solution for you called bookmarks for making sure your crawler only discovers new data rather than entire S3 contents when the crawler runs. You also have to use bookmarks in your jobs if you want your scheduled jobs to only get new data since the last run time. However neither DynamoDB nor MySQL provides bookmarks so you have to do custom jobs to take care of that issue.

Crawlers ONLY define your table schemas, they do not store data and they do not move data. Job that you will define in the next step is what will move your data from one source to another.
Fourth Step... Jobs

AWS Glue provides a visual tool to easily put together a job and gives you the flexibility to turn that into a script and lets you edit the scripts. There are few catches, but overall, Jobs are the most powerful aspect of AWS Glue. When you create a new job, you have the option to start with Visual Mode with pre filled Target/Source, Blank Visual Mode, Spark script editor, Python Shell script editor, and new addition Jupyter Notebook (which I know nothing about)

When you select a blank visual editor, you have 3 main options. Your source data connections, the transformations you would like to apply to the source data, and a destination target data connection.

As you can see from the images below, there are a lot of options, and you can explore them all yourself. It's fun! You can pretty much do anything a data scientist needs within this framework. You can run your own python scripts if you want to use that fancy merge function pandas provides or to your heart's content.. They support everything I tried to do with it.

As you can see from the last image below, you can chain together transformations one after another and do some magical stuff with your data before transferring it to its resting place :)

Once you create your job, you have the option to schedule it. This automatically creates a trigger for you. If you do not want to schedule it, you can also create triggers manually to run this job. You can also create workflows, and chain up jobs and schedule a trigger for that workflow to execute however frequent you wish.
Fifth Step... Triggers

Like I said in the previous step, if you only have 1 job to run, then you can easily schedule the job within the properties of the job itself. However, if you don't want to set the trigger from there, there is a Triggers tab which will let you schedule any job or workflow.
Sixth And Final Step... Workflows

This is optional if you have complicated jobs that require some sort of order to run. If you have 50 jobs but want the 49 to fail if the first one fails, you can set up very nice and convenient diagram flows. One thing I didn't like is that AWS Glue UI is still very primitive and can be frustrating when it does stuff you don't expect :)

To Add a workflow, you first define its basic details. After that you have to add a trigger for it. It can be either scheduled, or custom run which means you can have workflows where you only trigger manually in case something goes wrong. Pretty neat!

Once you add the workflow and the trigger, then you can chain any amount of jobs you want. You indicate whether you want the node to wait for the previous node or execute immediately.
And that is basically it. Once you do all the steps above, you will have defined all your target and source data definitions, you defined ETL process, and you scheduled it to run everyday :)

In part 2.. if I ever have time, I will put together a sample project where it will run to move data from one source to another and do some cool transformations to it.
AWS Glue, Engineering, Tutorial, ETL, Amazon, DynamoDB, Mysql, RDS